In the next few subsections, we dive into the details of how we drew these conclusions.
3.2 Data pre-processing
To conduct deeper analysis on our data, we added the following columns to our dataset:
BMI: the Body Mass Index of each person, a continuous variable calculated using the World Health Organization guideline: \(BMI = \frac{Weight}{Height^2}\).
IsOverweight: a binary categorical value whose value is 1 if the BMI of the person is more than 25, 0 otherwise.
These columns serve as alternate metrics of the obesity of an individual, a continuous variable and a binary categorical variable, to complement NObeyesdad, the obesity category level which is a ordinal categorical variable.
These new columns can be found and used directly in the dataset scripts/xinyi-zhao_files/CleanObesityDataSet.csv.
In order to improve the visualization of certain plots, we also made local modifications to the data types of some columns for only that given plot. For instance, to plot the alluvial diagrams (figure 10), we rounded the values of the following: Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Physical activity frequency (FAF), Time using technology devices (TUE) (some of the values of these variables were floats, presumably because the researchers took an average of multiple time periods). This made sure that the variables are discrete (whole numbers).
3.3 Analysis of Height, Weight and BMI variables
Since the dataset contains artificially generated data, we first needed to verify that the dataset matches our expectations, notably by visualizing the distribution of these three continuous variables.
Preliminary verifications of the number of rows, columns and missing values of our dataset show that there does not seem to be any problems with the structure of our dataset.
Code
cat("Number of rows in dataset: ", nrow(data), "\nNumber of columns in dataset: ", ncol(data),"\nNumber of missing values in dataset: ", sum(is.na(data)))
Number of rows in dataset: 2111
Number of columns in dataset: 20
Number of missing values in dataset: 0
Additionally, the data seems to align with our expectations, visualizing the Height, Weight and BMI variables shows that there are no biases with respect to the people surveyed.
Code
# Histograms of Height, Weight and BMIplot1 <-ggplot(data = data, aes(x = Height)) +geom_histogram(binwidth =0.02)plot2 <-ggplot(data = data, aes(x = Weight)) +geom_histogram(binwidth =2)plot3 <-ggplot(data = data, aes(x = BMI)) +geom_histogram(binwidth =1)grid.arrange(plot1, plot2, plot3, ncol =3,top =textGrob("Histograms of the continuous variables Height, Weight and BMI",gp=gpar(fontsize=16)))
Fig. 1: Histograms of Height, Weight and BMI
Code
shapiro.test(data$Height)
Shapiro-Wilk normality test
data: data$Height
W = 0.99323, p-value = 2.772e-08
Code
shapiro.test(data$Weight)
Shapiro-Wilk normality test
data: data$Weight
W = 0.9765, p-value < 2.2e-16
Code
shapiro.test(data$BMI)
Shapiro-Wilk normality test
data: data$BMI
W = 0.97475, p-value < 2.2e-16
The weight of the individuals does not appear to be normally distribution as it has a right skew. Body weight is found to have a right skew according to research.
The BMI of individuals
Lastly, we plotted a boxplot to better understand the other statistics (e.g. median, outliers) of the variable BMI.
Code
# Boxplot to visualize the BMI across all individuals ggplot(data, aes(x=BMI)) +geom_boxplot(fill ="lightblue", color ="black") +coord_cartesian(ylim =c(-2, 2)) +labs(title ="Boxplot of BMI across all individuals")
Fig. 3: Boxplot to visualize the BMI across all individuals
Code
fivenum(data$BMI)
[1] 12.99868 24.32580 28.71909 36.01650 50.81175
Code
ggplot(data, aes(x = Weight, y = NObeyesdad, fill = NObeyesdad)) +stat_density_ridges(alpha =0.7) +scale_fill_viridis_d() +labs(title ="Ridgeline Plot of Obesity Types", x ="weight", y ="Obesity Type") +theme_ridges()
Fig. 4: Ridgeline Plot of weight of individuals across Obesity Types
3.4 Preliminary visualizations
Code
set.seed(123456)data <- data %>%mutate_if(is.numeric, scale)data_num <- data[c("Age", "Height", "Weight", "NCP", "CH2O", "FAF", "TUE","BMI")]# Convert all columns to numericdata_num <-sapply(data_num, as.numeric)# Sample 1000 points for clearer visualizationdata_sampled <-sample_n(as.data.frame(data_num), 300)ggpairs(as.data.frame(data_sampled), aes(alpha =0.4, size =0.02), upper =list(continuous =wrap("cor", size =12))) +theme(text =element_text(size =35))
Fig. 5: Scatterplot of each pair of numeric variables in the dataset
Code
categories <-c('Insufficient_Weight','Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II','Obesity_Type_III')# Create a mapping from category to integercategory_mapping <-setNames(1:length(categories), categories)data$NObesity_numeric <-as.integer(factor(data$NObeyesdad, levels = categories))data_pca <- data[c("FCVC","NCP","FAF","TUE","NObesity_numeric")]cor_matrix <-cor(data_pca)# Reshape the data for heatmapmelted_cor <-melt(cor_matrix)# Creating heatmapggplot(melted_cor, aes(Var1, Var2, fill = value)) +geom_tile() +geom_text(aes(label =ifelse(value >0.1&value<1, round(value, 2), '')), color='darkblue',vjust =1) +scale_fill_gradient2(midpoint =0) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +ggtitle("Heatmap of Correlations")
Fig. 6: Heatmap of correlations between numerical variables
Code
# Creating individual density plots with smooth linesplot_fcvc <-ggplot(data, aes(x = FCVC)) +stat_density(aes(y = ..density..), geom ="area", fill="blue", alpha=0.5) +ggtitle("FCVC Density Plot") +xlab("FCVC") +ylab("Density")plot_ncp <-ggplot(data, aes(x = NCP)) +stat_density(aes(y = ..density..), geom ="area", fill="blue", alpha=0.5) +ggtitle("NCP Density Plot") +xlab("NCP") +ylab("Density")plot_faf <-ggplot(data, aes(x = FAF)) +stat_density(aes(y = ..density..), geom ="area", fill="blue", alpha=0.5) +ggtitle("FAF Density Plot") +xlab("FAF") +ylab("Density")plot_tue <-ggplot(data, aes(x = TUE)) +stat_density(aes(y = ..density..), geom ="area", fill="blue", alpha=0.5) +ggtitle("TUE Density Plot") +xlab("TUE") +ylab("Density")# Arrange the plots in a 2x2 gridgrid.arrange(plot_fcvc, plot_ncp, plot_faf, plot_tue, nrow =2)
Fig. 7: Density plots for Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Physical activity frequency (FAF), Time using technology devices (TUE)
3.5 Understanding the correlation between biological factors and obesity
3.6 Correlation between lifestyle habits and obesity
Code
# Create a mosaic plot for each variable in the table plotted against 'IsOverweight'data_mosaic <- data[c("Gender", "family_history_with_overweight","FAVC","SMOKE","SCC","IsOverweight")]data_mosaic$family_history_with_overweight <-fct_relevel(data_mosaic$family_history_with_overweight, "yes", "no")data_mosaic$SMOKE <-fct_relevel(data_mosaic$SMOKE, "yes", "no")data_mosaic$FAVC <-fct_relevel(data_mosaic$FAVC, "yes", "no")data_mosaic$SCC <-fct_relevel(data_mosaic$SCC, "yes", "no")data_mosaic$IsOverweight <-as.character(data_mosaic$IsOverweight)for(var innames(data_mosaic)[names(data_mosaic) !="IsOverweight"]) {# png(paste0("mosaic_", var, "_IsOverweight.png")) contingency_table <-table(data_mosaic[[var]],data_mosaic$IsOverweight)mosaicplot(contingency_table, main =paste("Mosaic Plot for", var), ylab ="Is Overweight", xlab = var, col =c("pink", "brown"))# dev.off()}
Fig. 8a: Mosaic Plot of Gender against IsOverweight
Fig. 8b: Mosaic Plot of Family History against IsOverweight
Fig. 8c: Mosaic Plot of Frequent consumption of high caloric food against IsOverweight
Fig. 8d: Mosaic Plot of weight of whether the person smokes against IsOverweight
Fig. 8e: Mosaic Plot of calories consumption monitoring against IsOverweight